Exoplanet.png

Executive Summary¶

This study seeks to define the key features that describe a habitable exoplanet through Prinicipal Component Analysis (PCA), Singular Value Decomposition (SVD), and Non-Negative Matrix Factorization (NMF). A concise version of the methodology is as follows:

  1. Application Programming Interface calls from https://exoplanetarchive.ipac.caltech.edu
  2. Data Pre-processing
  3. Performed PCA on whole dataset and one for each of the exoplanet dispositions (confirmed, candidate, and false positive) to determine the key PCs/features that may arise
  4. Analyses of all the findings were done using PCA, SVD, and NMF

Our results show us that confirmed exoplanets have fundamentally different important features than that of candidate and false positive exoplanets.

Introduction¶

Have you ever wondered if there is life outside of our planet? Are there any earth-like planets beyond our solar system that could be able to support life? So far, the only life that we know of is here in our planet Earth.

Plastics are broken down into microplastics and are now found in riverbanks, glaciers and even inside fishes. 75% of the planet consists of bodies of water without much ocean cleanup initiatives. With the deadline for irreversible damages brought about by global warming and many may have given up hope for this planet, is it time to find another and start over?

Finding a suitable planet for human habitation or at least, has the possibility of supporting life on its surface has its set of criteria. Some of which are stellar distance, planetary size, and composition. A candidate planet should be within the habitable zone or the goldilocks zone. This means that the planet should be far enough orbiting their star that it can absorb sufficient heat (has access to solar energy) and yet far enough to support liquid water on its surface.

Back in 2009, NASA's Kepler Space Telescope was launched to detect exoplanets that can be habitable for humans. Since then, thousands of planets outside of our solar system have been discovered. The said telescope uses transit method which detects the motion of planets whose orbit are seen edge-on from Earth whenever they crossed the line of sight between their star and Earth. These crossings, or transits, cause a periodic dimming of the star's glare which are being monitored by Kepler’s photometer.

This initiative ended in 2013 due to mechanical issues but since telescope was still functional, the mission was extended and named K2.

Transit Method\ Bright glares from the stars can hinder detection of exoplanets. Because of this, direct detection through telescopes were not reliable. Hence, astronomers developed a way to indirectly detect these objects of interests: to look at the effects that these exoplanets have as they orbit their star. As they pass infront (relative to the telescope) of the star, they temporarily block some of the star's light. Hence, it may seem that this particular light has dimmed and is producing less brightness for a particular time period. Depending on the amount of change, we could infer the size of the planet.
transit.gif
To demonstrate, imagine you have a flashlight pointed into a wall without any obstruction, now place an object, say a ball, in between the flashlight and the wall. You might notice that the light that is now being projected into the wall changed depending on the balls's size, material, and its distance from the flashlight. These observations will then be data to infer the ball's characteristics.

Until its decommissioning in 2018, Kepler was able to send volumes of data which are now available to the public through the NASA Exoplanet Archive of the NASA Exoplanet Science Institute.

We clustered exoplanet using factor analysis tools such as Principal Component Analysis (PCA), Singular Value Decomposition (SVD), and Non-Negative Matrix Factorization (NMF).

Data Description¶

This study used the data from the NASA Exoplanet Archive. It is an online catalog and data service for astronomical and stars and exoplanets that is reviewed and evaluated by a team of astronomers. It collates and cross-correlates astronomical data and information on exoplanets and their host stars and provides services to work with these data.

Among the services that this archive provides is an Application Programming Interface (API) which can return data about exoplanets. This study has used this service to retrieve the Cumulative Kepler Objects of Interest Table.

The dataset uses the prefix KOI which stands for Kepler object of interest. The variables of interest are as follows:

  • koi_period (Orbital Period): Interval between consecutive planetary transits
  • koi_time0bk (Transit Epoch): The time corresponding to the center of the first detected transit in Barycentric Julian Day (BJD) minus a constant offset of 2,454,833.0 days. The offset corresponds to 12:00 on Jan 1, 2009 UTC
  • koi_impact (Impact Parameter): The sky-projected distance between the center of the stellar disc and the center of the planet disc at conjunction, normalized by the stellar radius
  • koi_duration (Transit Duration): Duration of the observed transits. Duration is measured from first contact between the planet and star until last contact
  • koi_depth (Transit Depth): The fraction of stellar flux lost at the minimum of the planetary transit
  • koi_prad (Planetary Radius): Radius of the planet. Planetary radius is the product of the planet star radius ratio and the stellar radius
  • koi_teq (Equilibrium temperature): Approximation for the temperature of the planet. The calculation of equilibrium temperature assumes thermodynamic equilibrium between the incident stellar flux and the radiated heat from the planet
  • koi_insol (Insolation Flux): Insolation flux is another way to give the equilibrium temperature. It depends on the stellar parameters (specifically the stellar radius and temperature), and on the semi-major axis of the planet
  • koi_model_snr (Transit Signal-to-Noise Ratio): Transit depth normalized by the mean uncertainty in the flux during the transits
  • koi_tce_plnt_num (TCE Planet Number): TCE Planet Number federated to the KOI
  • koi_steff (Stellar Effective Temperature): Photospheric temperature of the star
  • koi_srad (Stellar Radius): Photospheric radius of the star
  • koi_kepmag (Kepler-band): Kepler-band (mag)

The whole dataset contains 9,565 observations (rows) and 50 features.

Methodology and Data Processing¶

The process of determining similar planets was initialized by data collection from https://exoplanetarchive.ipac.caltech.edu through the use of Application Programming Interface (API) Calls. The resulting dataset was then split into the dispositions confirmed, candidate, and false positive exoplanets.

PCA was done using the whole dataset as well as one for each of the previously mentioned dispositions. Performing PCA on the different subclasses or dispositions provided the team with further insights on the relevant features describing each.

An 80% cumulative variance explained was used to have a consistent n(please replace with the actual number) across all the subclasses or dispositions. Similar to PCA, SVD was done to check whether we reduce the number of singular values for an 80% cumulative variance explained.

In order to determine the latent factors critical for clustering the data, NMF was done. This process verifies the effectivity of predicting "confirmed" exoplanets in our universe.

Exploratory Data Analysis¶

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns

import requests
import json

import matplotlib.pyplot as plt
from scipy.spatial.distance import euclidean
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
Matplotlib created a temporary config/cache directory at /tmp/matplotlib-k1kpmp3_ because the default path (/home/fegango/.cache/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.

Originally contains 50 features and further reduced to 13 for interpretability. Most of the other features are error measurements like koi_impact_err1 and koi_prad_err2 or other irrelevant features like kepler_name and kepid. Using these error measurements might affect the interpretability of our analysis, so we decided to drop these features.

PCA was done using the whole dataset as well as one for each of the previously mentioned dispositions. Performing PCA on the different subclasses or dispositions provided the team with further insights on the relevant features describing each.

In [2]:
def _to_float(elem):
    try:
        return float(elem)
    except:
        return np.nan
def _map_disp(elem, _map = {"FALSE POSITIVE": 0, "CANDIDATE": 1, "CONFIRMED": 2}):
#     _map = {"FALSE POSITIVE": 0, "CANDIDATE": 1, "CONFIRMED": 2}
#     _map = {"FALSE POSITIVE": 0, "CANDIDATE": 0, "CONFIRMED": 1}
    try:
#         print(val)
        return _map[elem]
    except:
        return np.nan
def _get_data(split_disposition=False):
    """
    split_disposition: bool
        - splits original df into 3 based on koi_disposition. returns all 4 dfs
            (3 split dfs + original)
    """
    col_drop = ['kepid', 'kepoi_name', 'kepler_name', #'koi_disposition',
       'koi_pdisposition', 'koi_score', 'koi_fpflag_nt', 'koi_fpflag_ss',
       'koi_fpflag_co', 'koi_fpflag_ec', 'koi_period_err1',
       'koi_period_err2', 'koi_time0bk_err1',
       'koi_time0bk_err2', 'koi_impact_err1', 'koi_impact_err2',
       'koi_duration_err1', 'koi_duration_err2', 'koi_tce_delivname',
       'koi_depth_err1', 'koi_depth_err2', 'koi_prad_err1',
       'koi_prad_err2', 'koi_teq_err1', 'koi_teq_err2',
       'koi_insol_err1', 'koi_insol_err2', 'koi_steff_err1', 'koi_steff_err2',
       'koi_slogg', 'koi_slogg_err1', 'koi_slogg_err2',
       'koi_srad_err1', 'koi_srad_err2', 'ra_str', 'dec_str', 'koi_kepmag_err']
    url = 'https://exoplanetarchive.ipac.caltech.edu/cgi-bin/nstedAPI/nph-nstedAPI?table=cumulative'
    res = requests.get(url).text.split('\n')
    df = pd.DataFrame([elem.split(',') for elem in res])
    df.columns = df.iloc[0]
    df = df[1:].drop(col_drop, axis=1)
    df.koi_disposition = df.koi_disposition.apply(lambda val: _map_disp(val))
    df = df.apply(lambda col: col.apply(lambda elem: _to_float(elem))).dropna()
    if not split_disposition:
        return df
    fp_df = df.query('koi_disposition == 0')
    can_df = df.query('koi_disposition == 1')
    con_df = df.query('koi_disposition == 2')
    return fp_df, can_df, con_df, df
# all_df = _get_data()
fp_df, can_df, con_df, all_df = _get_data(split_disposition=True)
display(fp_df)
display(can_df)
display(con_df)
display(all_df)
koi_disposition koi_period koi_time0bk koi_impact koi_duration koi_depth koi_prad koi_teq koi_insol koi_model_snr koi_tce_plnt_num koi_steff koi_srad koi_kepmag
4 0.0 1.736952 170.307565 1.276 2.40641 8079.2 33.46 1395.0 891.96 505.6 1.0 5805.0 0.791 15.597
9 0.0 7.361790 132.250530 1.169 5.02200 233.7 39.21 1342.0 767.22 47.7 1.0 6227.0 1.958 12.660
15 0.0 11.521446 170.839688 2.483 3.63990 17984.3 150.51 753.0 75.88 622.1 1.0 5795.0 0.848 15.472
16 0.0 19.403938 172.484253 0.804 12.21550 8918.7 7.18 523.0 17.69 214.7 1.0 5043.0 0.680 15.487
17 0.0 19.221389 184.552164 1.065 4.79843 74284.0 49.29 698.0 55.97 2317.0 1.0 6117.0 0.947 15.341
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
9558 0.0 373.893980 261.496800 0.963 27.66000 730.0 2.51 206.0 0.42 18.5 3.0 5263.0 0.699 14.911
9559 0.0 8.589871 132.016100 0.765 4.80600 87.7 1.11 929.0 176.40 8.4 1.0 5638.0 1.088 14.478
9560 0.0 0.527699 131.705093 1.252 3.22210 1579.2 29.35 2088.0 4500.53 453.3 1.0 5638.0 0.903 14.082
9562 0.0 0.681402 132.181750 0.147 0.86500 103.6 1.07 2218.0 5713.41 12.3 1.0 6173.0 1.041 15.385
9564 0.0 4.856035 135.993300 0.134 3.07800 76.7 1.05 1266.0 607.42 8.2 1.0 6469.0 1.193 14.826

4381 rows × 14 columns

koi_disposition koi_period koi_time0bk koi_impact koi_duration koi_depth koi_prad koi_teq koi_insol koi_model_snr koi_tce_plnt_num koi_steff koi_srad koi_kepmag
3 1.0 19.899140 175.850252 0.969 1.78220 10829.0 14.60 638.0 39.30 76.3 1.0 5853.0 0.868 15.436
38 1.0 4.959319 172.258529 0.831 2.22739 9802.0 12.21 1103.0 349.40 696.5 1.0 5712.0 1.082 15.263
59 1.0 40.419504 173.564690 0.911 3.36200 6256.0 7.51 467.0 11.29 36.9 1.0 5446.0 0.781 15.487
63 1.0 7.240661 137.755450 1.198 0.55800 556.4 19.45 734.0 68.63 13.7 2.0 5005.0 0.765 15.334
64 1.0 3.435916 132.662400 0.624 3.13300 23.2 0.55 1272.0 617.61 8.7 3.0 5779.0 1.087 12.791
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
9539 1.0 7.268182 135.934800 0.780 4.98500 46.7 1.66 1444.0 1027.95 9.7 1.0 6297.0 2.219 13.729
9543 1.0 376.379890 486.602200 0.305 13.99000 1140.0 3.26 265.0 1.16 13.3 1.0 6231.0 0.955 15.632
9553 1.0 367.947848 416.209980 0.902 4.24900 1301.0 3.72 228.0 0.64 10.7 1.0 5570.0 0.855 15.719
9561 1.0 1.739849 133.001270 0.043 3.11400 48.5 0.72 1608.0 1585.81 10.6 1.0 6119.0 1.031 14.757
9563 1.0 333.486169 153.615010 0.214 3.19900 639.1 19.30 557.0 22.68 14.0 1.0 4989.0 7.824 10.998

1905 rows × 14 columns

koi_disposition koi_period koi_time0bk koi_impact koi_duration koi_depth koi_prad koi_teq koi_insol koi_model_snr koi_tce_plnt_num koi_steff koi_srad koi_kepmag
1 2.0 9.488036 170.53875 0.146 2.9575 615.8 2.26 793.0 93.59 35.8 1.0 5455.0 0.927 15.347
2 2.0 54.418383 162.51384 0.586 4.5070 874.8 2.83 443.0 9.11 25.8 2.0 5455.0 0.927 15.347
5 2.0 2.525592 171.59555 0.701 1.6545 603.3 2.75 1406.0 926.16 40.9 1.0 6031.0 1.046 15.509
6 2.0 11.094321 171.20116 0.538 4.5945 1517.5 3.90 835.0 114.81 66.5 1.0 6046.0 0.972 15.714
7 2.0 4.134435 172.97937 0.762 3.1402 686.0 2.77 1160.0 427.65 40.2 2.0 6046.0 0.972 15.714
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
8818 2.0 4.485592 135.43172 0.442 0.8161 2265.0 0.95 365.0 4.20 15.0 1.0 3236.0 0.193 15.737
8957 2.0 8.152759 134.19046 0.461 1.7460 16536.0 2.44 305.0 2.03 9.2 3.0 3327.0 0.189 17.475
9015 2.0 384.847556 314.97000 0.059 9.9690 189.9 1.09 220.0 0.56 12.3 1.0 5579.0 0.798 13.426
9084 2.0 3.875943 134.84758 0.025 2.3140 58.6 0.68 1081.0 323.21 14.2 1.0 5713.0 0.893 12.750
9182 2.0 24.278380 154.51325 0.717 4.5640 714.4 2.03 432.0 8.27 13.6 6.0 4450.0 0.707 14.164

2659 rows × 14 columns

koi_disposition koi_period koi_time0bk koi_impact koi_duration koi_depth koi_prad koi_teq koi_insol koi_model_snr koi_tce_plnt_num koi_steff koi_srad koi_kepmag
1 2.0 9.488036 170.538750 0.146 2.95750 615.8 2.26 793.0 93.59 35.8 1.0 5455.0 0.927 15.347
2 2.0 54.418383 162.513840 0.586 4.50700 874.8 2.83 443.0 9.11 25.8 2.0 5455.0 0.927 15.347
3 1.0 19.899140 175.850252 0.969 1.78220 10829.0 14.60 638.0 39.30 76.3 1.0 5853.0 0.868 15.436
4 0.0 1.736952 170.307565 1.276 2.40641 8079.2 33.46 1395.0 891.96 505.6 1.0 5805.0 0.791 15.597
5 2.0 2.525592 171.595550 0.701 1.65450 603.3 2.75 1406.0 926.16 40.9 1.0 6031.0 1.046 15.509
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
9560 0.0 0.527699 131.705093 1.252 3.22210 1579.2 29.35 2088.0 4500.53 453.3 1.0 5638.0 0.903 14.082
9561 1.0 1.739849 133.001270 0.043 3.11400 48.5 0.72 1608.0 1585.81 10.6 1.0 6119.0 1.031 14.757
9562 0.0 0.681402 132.181750 0.147 0.86500 103.6 1.07 2218.0 5713.41 12.3 1.0 6173.0 1.041 15.385
9563 1.0 333.486169 153.615010 0.214 3.19900 639.1 19.30 557.0 22.68 14.0 1.0 4989.0 7.824 10.998
9564 0.0 4.856035 135.993300 0.134 3.07800 76.7 1.05 1266.0 607.42 8.2 1.0 6469.0 1.193 14.826

8945 rows × 14 columns

In [3]:
sns.pairplot(all_df, hue="koi_disposition", diag_kind='kde')
pass

From the pairplots, we don't see any visible clustering in the scatter plots, but there is a clear correlation in koi_insol(Insolation Flux) and koi_teq(Equilibrium Temperature). This is expected since both variables involve different methods in measuring the planet's equilibrium temperature.

Feature and Target Variable Assignment_

In [4]:
# Features
_all_X = all_df.iloc[:, 1:]
_fp_X = fp_df.iloc[:, 1:]
_can_X = can_df.iloc[:, 1:]
_con_X = con_df.iloc[:, 1:]
In [5]:
# Target
_all_y = all_df.koi_disposition
_fp_y = fp_df.koi_disposition
_can_y = can_df.koi_disposition
_con_y = con_df.koi_disposition

Principal Component Analysis

We perform a PCA on the entire dataset to try to see whether there are observable patterns in their state space. In addition, we also perform PCA on the different dispositions to identify features that are important to each.

In [6]:
def pca(X, n=2):
    X = (X - X.mean(axis=0)) / X.std(axis=0)
    pca = PCA(n)
    X_new = pca.fit_transform(X)
    return X_new, pca.components_.T, pca.explained_variance_ratio_
In [7]:
_all_new, w_all, variance_explained_all = pca(_all_X, n=None)
_fp_new, w_fp, variance_explained_fp = pca(_fp_X, n=None)
_can_new, w_can, variance_explained_can = pca(_can_X, n=None)
_con_new, w_con, variance_explained_con = pca(_con_X, n=None)
In [8]:
df = pd.DataFrame(data = _all_new[:, :2], 
                  columns = ['PC1', 'PC2'])

df.reset_index(drop=True, inplace=True)
_all_y.reset_index(drop=True, inplace=True)

result_df = pd.concat([df, _all_y], axis=1)
In [9]:
# Visualize Principal Components with a scatter plot
fig = plt.figure(figsize = (12,10))
ax = fig.add_subplot(1,1,1) 
ax.set_xlabel('First Principal Component ', fontsize = 15)
ax.set_ylabel('Second Principal Component ', fontsize = 15)
ax.set_title('Principal Component Analysis (2PCs) for ExoPlanet Dataset', fontsize = 20)

targets = [0, 1, 2]
colors = ['r', 'g', 'b']
for target, color in zip(targets, colors):
    indicesToKeep = all_df.koi_disposition == target
    ax.scatter(result_df.loc[indicesToKeep, 'PC1'], 
               result_df.loc[indicesToKeep, 'PC2'], 
               c = color, 
               s = 10)
ax.legend(['False Positive', 'Candidate', 'Confirmed'])

features = list(_all_X.columns)
for feature, vec in zip(features, w_all):
    plt.arrow(0, 0, 20*vec[0], 20*vec[1], width=0.1, ec='none', fc='r')
    plt.text(25*vec[0], 25*vec[1], feature, ha='center', color='r')


ax.grid()
In [10]:
plt.rcParams["figure.figsize"] = (15, 5)

fig1, ax1 = plt.subplots(1, 3)
ax1[0].scatter(_fp_new[:, 0], _fp_new[:, 1])
ax1[1].scatter(_can_new[:, 0], _can_new[:, 1])
ax1[2].scatter(_con_new[:, 0], _con_new[:, 1])
ax1[0].set_xlabel("PC1_fp")
ax1[1].set_xlabel("PC1_can")
ax1[2].set_xlabel("PC1_con")
ax1[0].set_ylabel("PC2_fp")
ax1[1].set_ylabel("PC2_can")
ax1[2].set_ylabel("PC2_con")
ax1[0].set_title("False Positive")
ax1[1].set_title("Candidate")
ax1[2].set_title("Confirmed")

for feature, vec in zip(features, w_fp):
    ax1[0].arrow(0, 0, 20*vec[0], 20*vec[1], width=0.1, ec='none', fc='r')
    ax1[0].text(25*vec[0], 25*vec[1], feature, ha='center', color='r')
for feature, vec in zip(features, w_can):
    ax1[1].arrow(0, 0, 20*vec[0], 20*vec[1], width=0.1, ec='none', fc='r')
    ax1[1].text(25*vec[0], 25*vec[1], feature, ha='center', color='r')
for feature, vec in zip(features, w_con):
    ax1[2].arrow(0, 0, 20*vec[0], 20*vec[1], width=0.1, ec='none', fc='r')
    ax1[2].text(25*vec[0], 25*vec[1], feature, ha='center', color='r')

The interpretation of first two PCs for each subclass is as follows:

  • False Positive

    • PC1 is related to high equilibrium temperatures and low transit durations. It is probably related to radiant power or radiant flux.
    • PC2 is related to high planet radius and impact parameter and low transit depth and SNR. This could possibly be explained by an irregular planet shape.
  • Candidate

    • PC1 is related to high equilibrium temperatures and low transit durations. It is probably related to radiant power or radiant flux.
    • PC2 is related to high planet radius and impact parameter and low transit depth and SNR. This could possibly be explained by an irregular planet shape.
  • Confirmed

    • PC1 is related to high equilibrium temperatures, but this time, it is not described by low transit durations. It is probably related to radiant intensity or simply just the temperature
    • PC2 is related to high transit durations, transit depth, and planet radius. This indicates a low planet velocity at least as seen from Earth. The relatively high SNR in this case might also suggest some regularity in the shape of the planet.

From our interpretations of the principal components, we see that the interpreted features for the confirmed exoplanets are quite different from the interpreted features of the other two dispositions. This insight may be helpful in evaluating the status candidate exoplanets. Round Earth FTW! **

In [11]:
plt.rcParams["figure.figsize"] = (15,15)

fig1, ax1 = plt.subplots(4, 1)
ax1[0].plot(range(1, len(variance_explained_fp)+1), 
         variance_explained_fp.cumsum(), 'o-')
ax1[0].axhline(0.9, ls='--', color='g')
ax1[0].axvline(9, ls='--', color='g')
ax1[1].axhline(0.9, ls='--', color='g')
ax1[1].axvline(9, ls='--', color='g')
ax1[2].axhline(0.9, ls='--', color='g')
ax1[2].axvline(9, ls='--', color='g')
ax1[3].axhline(0.9, ls='--', color='g')
ax1[3].axvline(9, ls='--', color='g')
ax1[1].plot(range(1, len(variance_explained_can)+1), 
         variance_explained_can.cumsum(), 'o-')
ax1[2].plot(range(1, len(variance_explained_con)+1), 
         variance_explained_con.cumsum(), 'o-')
ax1[3].plot(range(1, len(variance_explained_all)+1), 
         variance_explained_all.cumsum(), 'o-')
ax1[0].set_ylim(0,1)
ax1[1].set_ylim(0,1)
ax1[2].set_ylim(0,1)
ax1[3].set_ylim(0,1)
plt.xlabel('number of PCs')
ax1[0].set_ylabel('cumulative variance explained')
ax1[1].set_ylabel('cumulative variance explained')
ax1[2].set_ylabel('cumulative variance explained')
ax1[3].set_ylabel('cumulative variance explained')
ax1[0].set_title("False Positive")
ax1[1].set_title("Candidate")
ax1[2].set_title("Confirmed")
Out[11]:
Text(0.5, 1.0, 'Confirmed')

Comparing the cumulative variance explained accross all disposistions, we get about 90% of the cumulative variance explained with 9 PCs. The low reduction in dimensionality suggests that each subset of the data has a lot of variance in multiple features.

Singular Value Decomposition

To supplement the insights from other analysis, the team conducted SVD on the entire dataset and the team identified 2 prominent features that affect the 2 singular vectors. On the first singular vector, it is positively influenced by the insolation flux. The second singular vector is positively influenced by the transit depth, which is also known as stellar flux lost. It's perpendicularity suggest independence from each other.

In [12]:
from sklearn.decomposition import TruncatedSVD
In [13]:
def Tsvd(X):
    svd = TruncatedSVD(n_components=13)
    svd.fit(X)
    cve = 0.80
    n_components = 1 + np.argmax(np.cumsum
                             (svd.explained_variance_ratio_) >= cve)
    X_new = svd.transform(X)[:, :n_components]
    return X_new, svd.components_.T, svd.explained_variance_ratio_
In [14]:
_all_new_svd, w_all_svd, variance_explained_all_svd = Tsvd(all_df)
_fp_new_svd, w_fp_svd, variance_explained_fp_svd = Tsvd(fp_df)
_can_new_svd, w_can_svd, variance_explained_can_svd = Tsvd(can_df)
_con_new_svd, w_con_svd, variance_explained_con_svd = Tsvd(con_df)
In [15]:
fig, ax = plt.subplots(1, 2, subplot_kw=dict(aspect='equal'), 
                       gridspec_kw=dict(wspace=0.4), dpi=150)
ax[0].scatter(_all_new_svd[:, 0], _all_new_svd[:, 1])
ax[0].set_xlabel('SV1')
ax[0].set_ylabel('SV2')

for feature, vec in zip(features, w_all_svd):
    ax[1].arrow(0, 0, vec[0], vec[1], width=0.01, ec='none', fc='r')
    ax[1].text(vec[0], vec[1], feature, ha='center', color='r', fontsize=5)
ax[1].set_xlim(-1, 1)
ax[1].set_ylim(-1, 1)
ax[1].set_xlabel('SV1')
ax[1].set_ylabel('SV2')
Out[15]:
Text(0, 0.5, 'SV2')

Non-negative Matrix Factorization

In [27]:
X = np.array(_all_X)
In [28]:
from sklearn.decomposition import NMF

nmf = NMF()
U = nmf.fit_transform(X)
V = nmf.components_.T
In [29]:
fig, ax = plt.subplots()
ax.spy(V)
ax.set_xticks(range(len(features)))
ax.set_yticks(range(len(features)))
ax.set_yticklabels(features)
Out[29]:
[Text(0, 0, 'koi_period'),
 Text(0, 1, 'koi_time0bk'),
 Text(0, 2, 'koi_impact'),
 Text(0, 3, 'koi_duration'),
 Text(0, 4, 'koi_depth'),
 Text(0, 5, 'koi_prad'),
 Text(0, 6, 'koi_teq'),
 Text(0, 7, 'koi_insol'),
 Text(0, 8, 'koi_model_snr'),
 Text(0, 9, 'koi_tce_plnt_num'),
 Text(0, 10, 'koi_steff'),
 Text(0, 11, 'koi_srad'),
 Text(0, 12, 'koi_kepmag')]

To add, we want to evaluate whether we can use clustering via lantent factors to identify confirmed exoplanets.

In [30]:
from sklearn.decomposition import PCA

pca = PCA(2)
plt.scatter(*pca.fit_transform(X).T, c=U.argmax(axis=1), cmap='Set1')
plt.xlabel('PC1')
plt.ylabel('PC2')
print(set(U.argmax(axis=1)))
{0, 1, 6, 7, 8, 9, 10, 11, 12}
In [31]:
# re-map koi-disposition to be 1 only if explanet is confirmed 
all_df['LF'] = U.argmax(axis=1)
all_df.koi_disposition = all_df.koi_disposition.apply(lambda elem: 0 if (elem == 1 or elem == 0) else 1)
In [32]:
total = all_df.groupby(['LF']).koi_disposition.count()
confirmed = all_df.groupby(['LF']).koi_disposition.sum()
_df = pd.DataFrame({'confirmed': confirmed, 'total': total})
In [33]:
_df.plot.barh(width=0.85, figsize=(15, 5))
Out[33]:
<AxesSubplot:ylabel='LF'>
In [34]:
plt.figure(figsize=(15, 5))
plt.bar(list(confirmed.index), confirmed/total)
plt.xlabel('Latent Factors')
plt.ylabel('% of Confirmed Exoplanets')
Out[34]:
Text(0, 0.5, '% of Confirmed Exoplanets')

The low percentage of confirmed exoplanets suggests that the clusters created through latent factors is not an effective way to evaluate whether a candidate is a confirmed exoplanet.

Results¶

Applying PCA on each of disposition shows us that indeed, confirmed exoplanets have fundamentally different important features than that of candidate and false positive exoplanets. Our SVD results indentifies generally the same important features for exoplanets. And lastly, we saw that clustering by latent factors is not an effective way to identify confirmed exoplanets.

Recommendation¶

Our interpretation of PCs probably have an equivalent physical quantity that can be measured or calculated. adding those quantities to our data might help describe the different observations(confirmed, candidate, or FP) that we have.

It was evident that the latent factors we got from our original features don't sufficiently describe our observations i.e., we got a low accuracy. Perhaps, by adding the new interpreted features, a better result from the clustering may be seen.

References and Acknowledgements¶

This research has made use of the NASA Exoplanet Archive, which is operated by the California Institute of Technology, under contract with the National Aeronautics and Space Administration under the Exoplanet Exploration Program

Cote, Jackson (2022, April 26). Exploring the dangers of microplastics. Northeastern University College of Engineering. Retrieved Dec 5, 2022 from https://coe.northeastern.edu/news/exploring-the-dangers-of-microplastics/

European Geosciences Union (2018, August 30). Deadline for climate action: Act strongly before 2035 to keep warming below 2°C. ScienceDaily. Retrieved December 5, 2022 from www.sciencedaily.com/releases/2018/08/180830084818.htm